perf: optimize array_remove for scalar needle#22390
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Nice performance win!
| ); | ||
| } | ||
| }; | ||
| let original_data = list_array.values().to_data(); |
There was a problem hiding this comment.
This will be inefficient for sliced arrays.
There was a problem hiding this comment.
I now slice the values to the range actually referenced by the offsets.
That said, I wanted to understand your concern better: when a GenericListArray is sliced, values() returns the full underlying array, and to_data() on it wraps the existing buffer references into ArrayData without copying. So the main downside I could identify is that Capacities::Array(original_data.len()) over-estimates the pre-allocation for sliced inputs. Were you thinking of a different inefficiency, or is the over-allocation what you had in mind?
There was a problem hiding this comment.
The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).
| let list_array = array.as_list::<i64>(); | ||
| general_remove_with_scalar::<i64>(list_array, needle, arr_n) | ||
| } | ||
| array_type => exec_err!("array_remove does not support type '{array_type}'."), |
There was a problem hiding this comment.
This is called by more than just array_remove; can we improve the error message?
| for (i, keep) in eq_array.iter().enumerate() { | ||
| if keep == Some(false) && removed < max_removals { | ||
| if let Some(bs) = pending_batch_to_retain { | ||
| mutable.extend(0, start + bs, start + i); | ||
| copied += i - bs; | ||
| pending_batch_to_retain = None; | ||
| } | ||
| removed += 1; | ||
| } else if pending_batch_to_retain.is_none() { | ||
| pending_batch_to_retain = Some(i); | ||
| } | ||
| } |
There was a problem hiding this comment.
I wonder if it would be possible to iterate only over the "false" bits, e.g., by negating the buffer and looking at BooleanBuffer::set_indices.
There was a problem hiding this comment.
Great suggestion. Benchmarks show a ~20–40% improvement with this optimization.
| let mut copied = 0usize; | ||
| let mut pending_batch_to_retain: Option<usize> = None; | ||
| for (i, keep) in eq_array.iter().enumerate() { | ||
| if keep == Some(false) && removed < max_removals { |
There was a problem hiding this comment.
Can we break from the loop once we hit max_removals?
There was a problem hiding this comment.
Good point. now break early once max_removals is reached.
| // Iterate only over the positions that need removal using set_indices, | ||
| // which is more efficient than scanning every bit. |
There was a problem hiding this comment.
Might be worth elaborating that the win here is mostly because we expect the # of values-to-remove is a lot smaller than the total array size, which it usually (but not always) will be.
| let keep_mask = | ||
| arrow_ord::cmp::distinct(list_array.values(), &Scalar::new(Arc::clone(needle)))?; |
There was a problem hiding this comment.
This will call the distinct kernel on all the elements in the value buffer, not just the ones that are visible in a sliced array.
| ); | ||
| } | ||
| }; | ||
| let original_data = list_array.values().to_data(); |
There was a problem hiding this comment.
The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).
neilconway
left a comment
There was a problem hiding this comment.
Awesome, nice work on this! lgtm. Maybe update the benchmark numbers in the PR description if you get a chance.
|
Thanks @neilconway! Really appreciate the thorough and speedy review 🙏 Great suggestion — running benchmarks now and will update the pr description with fresh numbers shortly. |
| /// | ||
| /// Uses a bulk `distinct` comparison for non-null, non-nested scalar elements, | ||
| /// falling back to the per-row `general_remove` path for null or nested types. | ||
| fn remove_with_scalar_needle( |
There was a problem hiding this comment.
I find this function a little confusing, since the code paths can then be like:
invoke with args -> remove_with_scalar_needle -> array_remove_internal (fallback)
Perhaps we can hoist this null/nested check earlier to remove need for this function? Or the benefits of centralizing the check outweighs it?
| let num_rows = args.number_rows; | ||
| let list_array = list_arg.to_array(num_rows)?; | ||
| let max_array = max_arg.to_array(num_rows)?; | ||
| let arr_n = as_int64_array(&max_array)?.values().to_vec(); |
There was a problem hiding this comment.
We're ignoring nulls in max_arg here, though I think this may be an existing issue
| let values_range_len = last_offset - first_offset; | ||
| let values_slice = list_array.values().slice(first_offset, values_range_len); | ||
| let original_data = values_slice.to_data(); | ||
| let mut offsets = Vec::<OffsetSize>::with_capacity(list_array.len() + 1); |
| let mut offsets = Vec::<OffsetSize>::with_capacity(list_array.len() + 1); | ||
| offsets.push(OffsetSize::zero()); | ||
|
|
||
| let mut mutable = MutableArrayData::with_capacities( |
There was a problem hiding this comment.
I wonder if an approach using take kernel could provide even more performance gains?
| let nulls = list_array.nulls().cloned(); | ||
| let keep_mask = | ||
| arrow_ord::cmp::distinct(&values_slice, &Scalar::new(Arc::clone(needle)))?; | ||
| let remove_bits = match keep_mask.nulls() { |
There was a problem hiding this comment.
distinctis similar toneq, only differing in null handling. In particular, two
operands are considered DISTINCT if they have a different value or if one of them is NULL
and the other isn’t. The result ofdistinctis never NULL.
https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.distinct.html
I don't think we should have handling around the null buffer of keep_mask according to the documentation
| needle: &ArrayRef, | ||
| arr_n: &[i64], |
There was a problem hiding this comment.
Similar question to that of array_replace PR, in if we should just handle when needle & max are scalars only
Which issue does this PR close?
Rationale for this change
Similar to #22387 (array_replace scalar optimization)
array_remove/array_remove_n/array_remove_allperform element-wise comparison by invokingcompare_element_to_listagainst each row's sub-array individually. When the needle is a scalar, this can be optimized by performing a single vectorizeddistinctcomparison over the entire flattened values buffer.What changes are included in this PR?
general_remove_with_scalar) that usesarrow_ord::cmp::distinctwithScalarwrapper for a single bulk comparison pass over the flat values buffer.nvalues, and LargeList type coverage.Benchmarks
Are these changes tested?
Yes, existing and new SLT edge-case tests in
array_remove.slt.Are there any user-facing changes?
No.